Using RNNs to model syllabic structure discrimination


Gonzalo Garcia-Castro
Nuria Sebastian-Galles
Chiara Santolin

PH3 Meeting, 2024/04/30

Santolin et al. (2024)

  • Chunking the continuous speech stream into relevant linguistic units (phonemes, syllables, words, phrases): major milestone in language acquisition
  • Consensus: syllables are advantaged (Bijeljac-Babic, Bertoncini, and Mehler 1993; Bertoncini et al. 1995)
  • From birth, infants can discriminate between syllables, based on their internal structure (e.g., CV, VC, CVC, CCV).

Syllable chunking and structure classification: mechanisms?

Language universals as Gates to Language (GaLa):

  • Sonority Sequencing Principle (SSP): sonority peaks at syllable nucleus.
  • Maximal Onset Principle (MOP): consonants are preferably grouped at the onset of a syllable instead of coda.

Sonority Sequencing Principle (SSP)

Sonority Sequencing Principle (SSP). Figure from Santolin et al., (2024).

Sonority Sequencing Principle (SSP)

Experimental series involving:

  • Neonates (fNIRS)
  • Infants (HPP)
  • Long-Evans rats (behavioural)
  • Artificial Recurrent Neural Networks (RNNs)

Santolin et al. (2024)

Do infant encode-generalize internal structure of syllables?

Head-turn Preference Procedure

Familiarization phase

CVC CCV
sam sma
gel gle
pus psu
dor dro
sen sne

Test phase

CVC CCV
sap spa
kos kso

Santolin et al. (2024)

  • Infants encode and generalise CVC and CCV syllabic structures only when familiarized with CVC
  • Generalisation occurs regardless of phonetic information

Research question

Are CCV syllables more difficult to process than CVC syllables?

Simulate experimental outputs using Recurrent Neural Networks (RNN)

Neural Networks (NN)

  • Bunch of regression models (nodes) stacked in layers
  • Receive input, generate output
  • Some nodes inform other nodes via connections whose relative importance is determined by weights

Recurrent Neural Networks (RNN)

  • Account for arbitrarily unfolding time series
  • Receive the additional input of their own previous state
  • Routinely used in speech recognition, text processing (e.g., transformers), and speech generation software

Proof of concept

Supervised audio classification task

  • Starting small: CV vs. VC syllables
  • Keep the model as simple (i.e., interpretable) as possible

Magnuson et al. (2020)

Audio processing

  • 7,000 audios, 700 unique syllables \(\times\) 10 speakers1
    • 3500 consonant-vowel (CV)
    • 3500 vowel-consonant (VC)
  • Amplitude envelope (Deloche, Bonnasse-Gahot, and Gervain 2024)
  • Normalized amplitude and duration (downsampling) across audios

RNN structure

  • 1 input node (receiving one audio sample in each time step)
  • 2x2 recurrent nodes
  • 1 output node (\(\sigma\) activation function), outputs a probability \(\in [0, 1]\)
    • \(\approx 1\) more likely CV, \(\approx 0\), more likely VC

RNN structure

  • 1 input node (receiving one audio sample in each time step)
  • 2x2 recurrent nodes
  • 1 output node (\(\sigma\) activation function), outputs a probability \(\in [0, 1]\)
    • \(\approx 1\) more likely CV, \(\approx 0\), more likely VC

Model training

  • Optimizer: ADAM (\(\epsilon = 0.001\))
  • Binary cross-entropy loss function
  • 30 epochs (early stopping at 95% accuracy)
  • Batch size: 16

Tensorflow + Keras

(Still tweaking things around.)

Results

Preliminary results

Preliminary results

Preliminary results

Future steps

  • Use spectrograms instead of envelope (e.g., Magnuson et al. 2020)
  • Unsupervised learning: replace output node with output layer (generation of spectrograms)
    • Encoder-decoder: What does the model think stereotipical CV or VC spectrograms look like?
  • More complex syllable structures: CVC, CCV
  • Take a look at embeddings

Future steps

Is the model better at classifying CVCs than CCVs? (Santolin et al. 2024)

  • Yes: complexity of the speech signal? (e.g., CC cluster)
  • No: infants have accumulated more experience with CVC (more frequent) than CCV?
    • If so, can we reproduce the results by manipulating the frequency of each syllabic structure in the model’s input

References

Bertoncini, Josiane, Caroline Floccia, Thierry Nazzi, and Jacques Mehler. 1995. “Morae and Syllables: Rhythmical Basis of Speech Representations in Neonates.” Language and Speech 38 (4): 311–29.
Bijeljac-Babic, Ranka, Josiane Bertoncini, and Jacques Mehler. 1993. “How Do 4-Day-Old Infants Categorize Multisyllabic Utterances?” Developmental Psychology 29 (4): 711.
Deloche, François, Laurent Bonnasse-Gahot, and Judit Gervain. 2024. “Acoustic Characterization of Speech Rhythm: Going Beyond Metrics with Recurrent Neural Networks.” arXiv Preprint arXiv:2401.14416.
Magnuson, James S, Heejo You, Sahil Luthra, Monica Li, Hosung Nam, Monty Escabi, Kevin Brown, et al. 2020. “EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition.” Cognitive Science 44 (4): e12823.
Santolin, Chiara, Konstantina Zacharaki, Juan Manuel Toro, and Nuria Sebastian-Galles. 2024. “Abstract Processing of Syllabic Structures in Early Infancy.” Cognition 244: 105663.